The Elo rating system is a method for calculating the relative skill levels of players in such as chess or esports. It is named after its creator Arpad Elo, a Hungarian-American chess master and physics professor.
The Elo system was invented as an improved chess-rating system over the previously used Harkness system, but is also used as a rating system in association football (soccer), American football, baseball, basketball, pool, various and esports, and, more recently, large language models.
The difference in the ratings between two players serves as a predictor of the outcome of a match. Two players with equal ratings who play against each other are expected to score an equal number of wins. A player whose rating is 100 points greater than their opponent's is expected to score 64%; if the difference is 200 points, then the expected score for the stronger player is 76%.Using the formula for equal to 100 or 200.
A player's Elo rating is a number that may change depending on the outcome of rated games played. After every game, the winning player takes points from the losing one. The difference between the ratings of the winner and loser determines the total number of points gained or lost after a game. If the higher-rated player wins, then only a few rating points will be taken from the lower-rated player. However, if the lower-rated player scores an upset win, many rating points will be transferred. The lower-rated player will also gain a few points from the higher rated player in the event of a draw. This means that this rating system is self-correcting. Players whose ratings are too low or too high should, in the long run, do better or worse correspondingly than the rating system predicts and thus gain or lose rating points until the ratings reflect their true playing strength.
Elo ratings are comparative only, and are valid only within the rating pool in which they were calculated, rather than being an absolute measure of a player's strength.
While Elo-like systems are widely used in two-player settings, variations have also been applied to multiplayer competitions. Elo-MMR: A Rating System for Massive Multiplayer Competitions
On behalf of the USCF, Elo devised a new system with a more sound Statistics basis. At about the same time, György Karoly and Roger Cook independently developed a system based on the same principles for the New South Wales Chess Association.Elo 1986, p. 4
Elo's system replaced earlier systems of competitive rewards with one based on statistical estimation. Rating systems for many sports award points in accordance with subjective evaluations of the 'greatness' of certain achievements. For example, winning an important golf tournament might be worth an arbitrarily chosen five times as many points as winning a lesser tournament.
A statistical endeavor, by contrast, uses a model that relates the game results to underlying variables representing the ability of each player.
Elo's central assumption was that the chess performance of each player in each game is a normally distributed random variable. Although a player might perform significantly better or worse from one game to the next, Elo assumed that the mean value of the performances of any given player changes only slowly over time. Elo thought of a player's true skill as the mean of that player's performance random variable.
A further assumption is necessary because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and derive a number to represent that player's skill. Performance can only be inferred from wins, draws, and losses. Therefore, a player who wins a game is assumed to have performed at a higher level than the opponent for that game. Conversely, a losing player is assumed to have performed at a lower level. If the game ends in a draw, the two players are assumed to have performed at nearly the same level.
Elo did not specify exactly how close two performances ought to be to result in a draw as opposed to a win or loss. Actually, there is a probability of a draw that is dependent on the performance differential, so this latter is more of a confidence interval than any deterministic frontier. And while he thought it was likely that players might have different standard deviations to their performances, he made a simplifying assumption to the contrary.
To simplify computation even further, Elo proposed a straightforward method of estimating the variables in his model (i.e., the true skill of each player). One could calculate relatively easily from tables how many games players would be expected to win based on comparisons of their ratings to those of their opponents. The ratings of a player who won more games than expected would be adjusted upward, while those of a player who won fewer than expected would be adjusted downward. Moreover, that adjustment was to be in linear proportion to the number of wins by which the player had exceeded or fallen short of their expected number.
From a modern perspective, Elo's simplifying assumptions are not necessary because computing power is inexpensive and widely available. Several people, most notably Mark Glickman, have proposed using more sophisticated statistical machinery to estimate the same variables. On the other hand, the computational simplicity of the Elo system has proven to be one of its greatest assets. With the aid of a pocket calculator, an informed chess competitor can calculate to within one point what their next officially published rating will be, which helps promote a perception that the ratings are fair.
Subsequent statistical tests have suggested that chess performance is almost certainly not distributed as a normal distribution, as weaker players have greater winning chances than Elo's model predicts.Elo 1986, ch. 8.73.Glickman, Mark E., and Jones, Albyn C., http://www.glicko.net/research/chance.pdf (1999), Chance, 12, 2, 21-28. In paired comparison data, there is often very little practical difference in whether it is assumed that the differences in players' strengths are normally or logistically distributed. Mathematically, however, the logistic function is more convenient to work with than the normal distribution.Glickman, Mark E. (1995), http://www.glicko.net/research/acjpaper.pdf A subsequent version of this paper appeared in the American Chess Journal, 3, pp. 59–102. FIDE continues to use the rating difference table as proposed by Elo.
The development of the Percentage Expectancy Table (table 2.11) is described in more detail by Elo as follows:Elo 1986, p159.
The normal probabilities may be taken directly from the standard tables of the areas under the normal curve when the difference in rating is expressed as a z score. Since the standard deviation σ of individual performances is defined as 200 points, the standard deviation σ' of the differences in performances becomes σ√2 or 282.84. The z value of a difference then is . This will then divide the area under the curve into two parts, the larger giving P for the higher rated player and the smaller giving P for the lower rated player.For example, let . Then . The table gives and as the areas of the two portions under the curve. These probabilities are rounded to two figures in table 2.11.
The table is actually built with standard deviation as an approximation for .
The normal and logistic distributions are, in a way, arbitrary points in a spectrum of distributions which would work well. In practice, both of these distributions work very well for a number of different games.
Instead one may refer to the organization granting the rating. For example: "As of April 2018, Tatev Abrahamyan had a FIDE rating of 2366 and a USCF rating of 2473." The Elo ratings of these various organizations are not directly comparable, since Elo ratings measure the results within a closed pool of players rather than absolute skill.
The following analysis of the July 2015 FIDE rating list gives a rough impression of what a given FIDE rating means in terms of world ranking:
The highest ever FIDE rating was 2882, which Magnus Carlsen had on the May 2014 list. A list of the highest-rated players ever is at Comparison of top chess players throughout history.
+800 |
+677 |
+366 |
+240 |
+149 |
+72 |
0 |
−72 |
−149 |
−240 |
−366 |
−677 |
−800 |
Example: 2 wins (opponents & ), 2 losses (opponents & )
This is a simplification, but it offers an easy way to get an estimate of PR (performance rating).
FIDE, however, calculates performance rating by means of the formulawhere "rating difference" is based on a player's tournament percentage score , which is then used as the key in a lookup table where is simply the number of points scored divided by the number of games played. Note that, in case of a perfect or no score is 800.
Although Live ratings are unofficial, interest arose in Live ratings in August/September 2008 when five different players took the "Live" No. 1 ranking.Anand lost No. 1 to Morozevich ( Chessbase, August 24 2008 ), then regained it, then Carlsen took No. 1 ( Chessbase, September 5 2008 ), then Ivanchuk ( Chessbase, September 11 2008 ), and finally Topalov ( Chessbase, September 13 2008 )
The unofficial live ratings of players over 2700 were published and maintained by Hans Arild Runde at the Live Rating website until August 2011. Another website, 2700chess.com, has been maintained since May 2011 by Artiom Tsepotan, which covers the top 100 players as well as the top 50 female players.
Rating changes can be calculated manually by using the FIDE ratings change calculator. All top players have a K-factor of 10, which means that the maximum ratings change from a single game is a little less than 10 points.
where is the number of rated games won, is the number of rated games drawn, and is the number of events in which the player completed three or more rated games.
Higher rating floors exist for experienced players who have achieved significant ratings. Such higher rating floors exist, starting at ratings of 1200 in 100-point increments up to 2100 (1200, 1300, 1400, ..., 2100). A rating floor is calculated by taking the player's peak established rating, subtracting 200 points, and then rounding down to the nearest rating floor. For example, a player who has reached a peak rating of 1464 would have a rating floor of , which would be rounded down to 1200. Under this scheme, only Class C players and above are capable of having a higher rating floor than their absolute player rating. All other players would have a floor of at most 150.
There are two ways to achieve higher rating floors other than under the standard scheme presented above. If a player has achieved the rating of Original Life Master, their rating floor is set at 2200. The achievement of this title is unique in that no other recognized USCF title will result in a new floor. For players with ratings below 2000, winning a cash prize of $2,000 or more raises that player's rating floor to the closest 100-point level that would have disqualified the player for participation in the tournament. For example, if a player won $4,000 in a 1750-and-under tournament, they would now have a rating floor of 1800.
A player's expected score is their probability of winning plus half their probability of drawing. Thus, an expected score of 0.75 could represent a 75% chance of winning, 25% chance of losing, and 0% chance of drawing. On the other extreme it could represent a 50% chance of winning, 0% chance of losing, and 50% chance of drawing. The probability of drawing, as opposed to having a decisive result, is not specified in the Elo system. Instead, a draw is considered half a win and half a loss. In practice, since the true strength of each player is unknown, the expected scores are calculated using the player's current ratings as follows.
If player has a rating of and player a rating of , the exact formula (using the logistic curve with common logarithm)Elo 1986, p. 141, ch. 8.4& Logistic probability as a rating basis for the expected score of player is
Similarly, the expected score for player is
This could also be expressed by
and
where and Note that in the latter case, the same denominator applies to both expressions, and it is plain that This means that by studying only the numerators, we find out that the expected score for player is times the expected score for player . We can achieve this algebraically by subtracting 1 from the reciprocal of before multiplying . It then follows that for each 400 rating points of advantage over the opponent, the expected score is magnified ten times in comparison to the opponent's expected score.
When a player's actual tournament scores exceed their expected scores, the Elo system takes this as evidence that player's rating is too low, and needs to be adjusted upward. Similarly, when a player's actual tournament scores fall short of their expected scores, that player's rating is adjusted downward. Elo's original suggestion, which is still widely used, was a simple linear adjustment proportional to the amount by which a player over-performed or under-performed their expected score. The maximum possible adjustment per game, called the K-factor, was set at for masters and for weaker players.
Suppose player (again with rating ) was expected to score points but actually scored points. The formula for updating that player's rating is
This update can be performed after each game or each tournament, or after any suitable rating period.
An example may help to clarify:
This updating procedure is at the core of the ratings used by FIDE, USCF, Yahoo! Games, the Internet Chess Club (ICC) and the Free Internet Chess Server (FICS). However, each organization has taken a different approach to dealing with the uncertainty inherent in the ratings, particularly the ratings of newcomers, and to dealing with the problem of ratings inflation/deflation. New players are assigned provisional ratings, which are adjusted more drastically than established ratings.
The principles used in these rating systems can be used for rating other competitions—for instance, international football matches.
Elo ratings have also been applied to games without the possibility of draws, and to games in which the result can also have a quantity (small/big margin) in addition to the quality (win/loss). See Go rating with Elo for more.
Elo's original -factor estimation was made without the benefit of huge databases and statistical evidence. Sonas indicates that a -factor of 24 (for players rated above 2400) may be both more accurate as a predictive tool of future performance and be more sensitive to performance.A key Sonas article is
Certain Internet chess sites seem to avoid a three-level K-factor staggering based on rating range. For example, the ICC seems to adopt a global except when playing against provisionally rated players.
The USCF (which makes use of a logistic distribution as opposed to a normal distribution) formerly staggered the K-factor according to three main rating ranges:
below 2100 |
between 2100 and 2400 |
above 2400 |
Currently, the USCF uses a formula that calculates the -factor based on factors including the number of games played and the player's rating. The K-factor is also reduced for high rated players if the event has shorter time controls.
FIDE uses the following ranges:
for a player new to the rating list until the completion of events with a total of 30 games, and for all players until their 18th birthday, as long as their rating remains under 2300. |
for players who have always been rated under 2400. |
for players with any published rating of at least 2400 and at least 30 games played in previous events. Thereafter it remains permanently at 10. |
FIDE used the following ranges before July 2014:
The gradation of the -factor reduces rating change at the top end of the rating range, reducing the possibility for rapid rise or fall of rating for those with a rating high enough to reach a low -factor.
In theory, this might apply equally to online chess players and over-the-board players, since it is more difficult for all players to raise their rating after their rating has become high and their -factor consequently reduced. However, when playing online, 2800+ players can more easily raise their rating by simply selecting opponents with high ratings – on the ICC playing site, a grandmaster may play a string of different opponents who are all rated over 2700. In over-the-board events, it would only be in very high level all-play-all events that a player would be able to engage that number of 2700+ opponents. In a normal, open, Swiss-paired chess tournament, frequently there would be many opponents rated less than 2500, reducing the ratings gains possible from a single contest for a high-rated player.
If we assume that the game results are binary, that is, only a win or a loss can be observed, the problem can be addressed via logistic regression, where the games results are dependent variables, the players' ratings are independent variables, and the model relating both is probabilistic: the probability of the player winning the game is modeled as
where
denotes the difference of the players' ratings, and we use a scaling factor , and, by law of total probability
The log loss is then calculated as
and, using the stochastic gradient descent the log loss is minimized as follows:
where is the adaptation step.
Since , , and , the adaptation is then written as follows
which may be compactly written as
where is the new adaptation step which absorbs and , if wins and if wins, and the expected score is given by .
Analogously, the update for the rating is
To address these difficulties, and to derive the Elo rating in the ternary games, we will define the explicit probabilistic model of the outcomes. Next, we will minimize the log loss via stochastic gradient.
Since the loss, the draw, and the win are Ordinal data, we should adopt the model which takes their ordinal nature into account, and we use the so-called adjacent categories model which may be traced to the Davidson's work
where
and is a parameter. Introduction of a free parameter should not be surprising as we have three possible outcomes and thus, an additional degree of freedom should appear in the model. In particular, with we recover the model underlying the logistic regression
where .
Using the ordinal model defined above, the log loss is now calculated as
\ell =\begin{cases} -\log \sigma(r_{\mathsf{A,B}};\kappa) & \textrm{if}~ \mathsf{A}~\textrm{wins},\\ -\log \sigma(-r_{\mathsf{A,B}};\kappa) & \textrm{if}~ \mathsf{B}~\textrm{wins},\\ -\log \kappa -\frac{1}{2}\log\sigma(r_{\mathsf{A,B}};\kappa) - \frac{1}{2}\log\sigma(-r_{\mathsf{A,B}};\kappa) & \textrm{if}~ \mathsf{A}~\textrm{draw}, \end{cases}
which may be compactly written as
\ell =-(S_{\mathsf{A}} +\frac{1}{2}D)\log \sigma(r_{\mathsf{A,B}};\kappa) -(S_{\mathsf{B}} +\frac{1}{2}D) \log \sigma(-r_{\mathsf{A,B}};\kappa) -D\log \kappa
where iff wins, iff wins, and iff draws.
As before, we need the derivative of which is given by
where
Thus, the derivative of the log loss with respect to the rating is given by
where we used the relationships and .
Then, the stochastic gradient descent applied to minimize the log loss yields the following update for the rating
where and . Of course, if wins, if draws, and if loses. To recognize the origin in the model proposed by Davidson, this update is called an Elo-Davidson rating.
The update for is derived in the same manner as
where .
We note that
and thus, we obtain the rating update may be written as
where and we obtained practically the same equation as in the Elo rating except that the expected score is given by instead of .
Of course, as noted above, for , we have and thus, the Elo-Davidson rating is exactly the same as the Elo rating. However, this is of no help to understand the case when the draws are observed (we cannot use which would mean that the probability of draw is null). On the other hand, if we use , we have
which means that, using , the Elo-Davidson rating is exactly the same as the Elo rating.
Beyond the chess world, concerns over players avoiding competitive play to protect their ratings caused Wizards of the Coast to abandon the Elo system for tournaments in favour of a system of their own devising called "Planeswalker Points".
Therefore, Elo ratings online still provide a useful mechanism for providing a rating based on the opponent's rating. Its overall credibility, however, needs to be seen in the context of at least the above two major issues described—engine abuse, and selective pairing of opponents.
The ICC has also recently introduced "auto-pairing" ratings which are based on random pairings, but with each win in a row ensuring a statistically much harder opponent who has also won x games in a row. With potentially hundreds of players involved, this creates some of the challenges of a major large Swiss event which is being fiercely contested, with round winners meeting round winners. This approach to pairing certainly maximizes the rating risk of the higher-rated participants, who may face very stiff opposition from players below 3000, for example. This is a separate rating in itself, and is under "1-minute" and "5-minute" rating categories. Maximum ratings achieved over 2500 are exceptionally rare.
For example, player starts with a 1400 rating and with 1800 in a tournament using (brown curves). The blue dash-dot line denotes the initial rating difference of 400 (). The probability of winning, the expected outcome, is 0.91 (intersection of black solid curve and blue line); if this happens, 's rating decreases by 3 (intersection of brown solid curve and blue line) to 1397 and 's increases by the same amount to 1803. Conversely, the probability of winning, the unexpected outcome, is 0.09 (intersection of black dotted curve and blue line); if this happens, 's rating increases by 29 (intersection of brown dotted curve and blue line) to 1429 and 's decreases by the same amount to 1771.]]
The term "inflation", applied to ratings, is meant to suggest that the level of playing strength demonstrated by the rated player is decreasing over time; conversely, "deflation" suggests that the level is advancing. For example, if there is inflation, a modern rating of 2500 means less than a historical rating of 2500, while the reverse is true if there is deflation. Using ratings to compare players between different eras is made more difficult when inflation or deflation are present. (See also Comparison of top chess players throughout history.)
Analyzing FIDE rating lists over time, Jeff Sonas suggests that inflation may have taken place since about 1985. Sonas looks at the highest-rated players, rather than all rated players, and acknowledges that the changes in the distribution of ratings could have been caused by an increase of the standard of play at the highest levels, but looks for other causes as well.
The number of people with ratings over 2700 has increased. Around 1979 there was only one active player (Anatoly Karpov) with a rating this high. In 1992 Viswanathan Anand was only the 8th player in chess history to reach the 2700 mark at that point of time. This increased to 15 players by 1994. 33 players had a 2700+ rating in 2009 and 44 as of September 2012. Only 14 players have ever broken a rating of 2800.
One possible cause for this inflation was the rating floor, which for a long time was at 2200, and if a player dropped below this they were struck from the rating list. As a consequence, players at a skill level just below the floor would only be on the rating list if they were overrated, and this would cause them to feed points into the rating pool. In July 2000 the average rating of the top 100 was 2644. By July 2012 it had increased to 2703.
Using a strong chess engine to evaluate moves played in games between rated players, Regan and Haworth analyze sets of games from FIDE-rated tournaments, and draw the conclusion that there had been little or no inflation from 1976 to 2009.
In a pure Elo system, each game ends in an equal transaction of rating points. If the winner gains N rating points, the loser will drop by N rating points. This prevents points from entering or leaving the system when games are played and rated. However, players tend to enter the system as novices with a low rating and retire from the system as experienced players with a high rating. Therefore, in the long run a system with strictly equal transactions tends to result in rating deflation.
In 1995, the USCF acknowledged that several young scholastic players were improving faster than the rating system was able to track. As a result, established players with stable ratings started to lose rating points to the young and underrated players. Several of the older established players were frustrated over what they considered an unfair rating decline, and some even quit chess over it.A conversation with Mark Glickman [14] , Published in Chess Life October 2006 issue
Rating floors in the United States work by guaranteeing that a player will never drop below a certain limit. This also combats deflation, but the chairman of the USCF Ratings Committee has been critical of this method because it does not feed the extra points to the improving players. A possible motive for these rating floors is to combat sandbagging, i.e., deliberate lowering of ratings to be eligible for lower rating class sections and prizes.
For some ratings estimates, see Chess engine § Ratings.
College football used the Elo method as a portion of its Bowl Championship Series rating systems from 1998 to 2013 after which the BCS was replaced by the College Football Playoff. Jeff Sagarin of USA Today publishes team rankings for most American sports, which includes Elo system ratings for college football. The use of rating systems was effectively scrapped with the creation of the College Football Playoff in 2014.
In other sports, individuals maintain rankings based on the Elo algorithm. These are usually unofficial, not endorsed by the sport's governing body. The World Football Elo Ratings is an example of the method applied to men's football. In 2006, Elo ratings were adapted for Major League Baseball teams by Nate Silver, then of Baseball Prospectus. Based on this adaptation, both also made Elo-based Monte Carlo simulations of the odds of whether teams will make the playoffs. In 2014, Beyond the Box Score, an SB Nation site, introduced an Elo ranking system for international baseball.
In tennis, the Elo-based Universal Tennis Rating (UTR) rates players on a global scale, regardless of age, gender, or nationality. It is the official rating system of major organizations such as the Intercollegiate Tennis Association and World TeamTennis and is frequently used in segments on the Tennis Channel. The algorithm analyzes more than 8 million match results from over 800,000 tennis players worldwide. On May 8, 2018, Rafael Nadal—having won 46 consecutive sets in clay court matches—had a near-perfect clay UTR of 16.42.
In pool, an Elo-based system called Fargo Rate is used to rank players in organized amateur and professional competitions.
One of the few Elo-based rankings endorsed by a sport's governing body is the FIFA Women's World Rankings, based on a simplified version of the Elo algorithm, which FIFA uses as its official ranking system for national teams in women's football. From the first ranking list after the 2018 FIFA World Cup, FIFA has also used Elo for their FIFA Men's World Rankings.
In 2015, Nate Silver, editor-in-chief of the statistical commentary website FiveThirtyEight, and Reuben Fischer-Baum produced Elo ratings for every National Basketball Association team and season through the 2014 season.Reuben Fischer-Baum and Nate Silver, "The Complete History of the NBA," FiveThirtyEight, May 21, 2015.[15] In 2014 FiveThirtyEight created Elo-based ratings and win-projections for the American professional National Football League..
The English Korfball Association rated teams based on Elo ratings, to determine handicaps for their cup competition for the 2011/12 season.
An Elo-based ranking of National Hockey League players has been developed. The hockey-Elo metric evaluates a player's overall two-way play: scoring AND defense in both even strength and power-play/penalty-kill situations.
Rugbyleagueratings.com uses the Elo rating system to rank international and club rugby league teams.
Hemaratings.com was started in 2017 and uses a Glicko-2 algorithm to rank individual Historical European martial arts fencers worldwide in different categories such as Longsword,Rapier, historical Sabre and Sword & Buckler.
Few video games use the original Elo rating system. According to Lichess, an online chess server, the Elo system is outdated, with Glicko-2 now being used by many chess organizations. PlayerUnknown’s Battlegrounds is one of the few video games that utilizes the very first Elo system. In Guild Wars, Elo ratings are used to record guild rating gained and lost through guild-versus-guild battles. In 1998, an online gaming ladder called Clanbase was launched, which used the Elo scoring system to rank teams. The initial K-value was 30, but was changed to 5 in January 2007, then changed to 15 in July 2009. The site later went offline in 2013. A similar alternative site was launched in 2016 under the name Scrimbase, which also used the Elo scoring system for ranking teams. Since 2005, Golden Tee Live has rated players based on the Elo system. New players start at 2100, with top players rating over 3000.
Despite many video games using different systems for matchmaking, it is common for players of ranked video games to refer to all matchmaking ratings as Elo.
The Elo rating system has been used in biology for assessing male dominance hierarchies, and in automation and computer vision for fabric inspection.
Moreover, online judge sites are using the Elo rating system or its derivatives. For example, Topcoder is using a modified version based on normal distribution, while Codeforces is using another version based on logistic distribution.
The Elo rating system has been noted in dating apps, such as in the matchmaking app Tinder, which uses a variant of the Elo rating system.
The YouTuber Marques Brownlee and his team used the Elo rating system when they let people vote between digital photos taken with different smartphone models launched in 2022.
The Elo rating system has been used in U.S. revealed preference college rankings, such as those by the digital credential firm Parchment.
The Elo rating system has been adopted to evaluate AI models. In 2021, Anthropic utilized the Elo system for ranking AI models in their research. The LMArena briefly employed the Elo rating system to rank AI models before transitioning to Bradley–Terry model.
|
|